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INTRODUCTION 



1 . 

The purpose of This paper is To presenT several FORTRAN 
subrouTines Tor updaTing The QR decompos i T i on oT a maTrix. LeT 
A G , m > n, have a QR decompos i T i on A = QR, where Q G has 

orThonormal columns, and R G R^^^ is upper Triangular. Assume ThaT 
The elemenTs of Q and R are expliciTly known. LeT A G R^^^ , P > q, 
be obTained from A by inserTing or deleTing a row or a column, or leT 
A be a rank -one mod i T i caT ion of A , i . e . , A = A -I- vu"*" , where u G R^, 

V G R^ . Then a QR-decompos i T i on of A, A = QR , where Q G R^^^ has 
orThonormal columns and R gR^^^ is upper Triangular, can be compuTed 
in 0(mn) ariThmeTic operaTions by updaTing Q and R; see Daniel eT al . 
[5] . The updaTing is done by applying Givens reflecTors. The 
operaTion counT Tor updaTing Q and R compares favorably wiTh The 
O(mn^) ariThmeTic operaTions necessary To compuTe a QR decompos i T i on 
of a general mxn maTrix. 

Algol procedures Tor compuTing Q and R from Q and R are presenTed 
by Daniel eT al . [5] . Buckley [2] TranslaTed These procedures inTo 

FORTRAN. Our FORTRAN subrouTines implemenT mod if i caT i ons of The 
Algol pi^ocedures in [5] . These mod i T i caT i ons speed up The 
subrouTines and make Them suiTable Tor use on vecTor compuTers. This 
is illusTraTed by Timing experimenTs. 

Several program libraries, such as LINPACK [6] and NAG [14] , 
provide subrouTines Tor updaTing R only, buT conTain no rouTines Tor 
updaTing The compleTe QR decompos i T i on . AdvanTages of updaTing boTh 
Q and R include ThaT downdaTing can be carried ouT sTably, and ThaT 
The individual elemenTs of projecTions are easily accessible; see 
LINPACK [6, p. 10.23] , Daniel eT al . [5] , and STewarT [17] . 
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The first, c omp re liens i ve survey of updating algor ithrir^ was 
presented by Gill et al. [8] , and a recent discussion with references 
to applications can be found in Golub and V'an Loan [10, Chapter 
12.6] . The applications include linear least squares problems, 
regression analysis, and the solution of nonlinear systems of 
equations. See Allen [l] , Goldfarb [9] , Gragg and Stewart [11] . More 
and Sorensen [13] . The algorithms would also appear to be applicable 
to recursive least squares problems of signal processing; see Ling et 
al . [12] . 

We also present a subroutine which implements the rank revealing 
QR decomposition method recently' proposed by Chan [3] . In this 
method the QR decomposition A = QR is updated to yield the QR 
decomposition A = QR , where A is obtained from A by column 
permutation. This permutation is selected so that, in general . the 
element(s) in the lower right corner of R are small if A has nearl\^ 
linearly dependent columns. The subroutine can be used to solve the 
subset selection problem; see Golub and V'an Loan [10] . Table 1.1 
lists the FORTRAN subroutines for updating the QR decomposition. All 
subroutines use double precision arithmetic and are written in 
FORTRAN 77. Section 2 contains programming details for the 
subroutines of Table 1.1 and for certain auxiliary subprograms. For 
all subroutines of Table 1.1. except DRRPM , the numerical method as 
well as Algol procedures have been presented in [5] . For these 
subroutines we will only discuss differences between our FORTRAN 
subroutines and the Algol procedures. These differences stem in part 
from the algorithms being sped up. as well as from the use of simple 
subroutines, BLAS , for elementai'y vector and matrix operations. 
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Sub rout: i ne 



Purpose 



DDELC 



Computes Q,R ^rom Q,R when A is obtained i^rom A by 
deleting a column; see [5] . 



DDELR 



Computes Q,R -from Q,R when A is obtained -from A by 
deleting a row; see [5] . 



DINSC 



Computes Q,R i^rom Q,R when A is obtained i^rom A by 
inserting a column; see [5] . 



DINSR 



Computes Q,R i^rom Q,R when A is obtained from A by 
inserting a row; see [5] . 



DRNKl 



Computes Q,R from Q,R when A is a rank-one 
modification of A; see [5] . 



DRRPM 



Computes Q,R from Q,R when A is obtained by 
permuting the columns of A in a manner that 
generally reveals if columns of A are nearly 
linearly dependent; see [3] . 



Table 1.1: Subroutines for updating a QR decomposition A = QR 

to yield a QR decomposition A = QR. 

The BLAS are discussed in Section 3. They have been written to 
vectorize efficiently on a IBM 3090-200VF computer using the 
vectorizing compiler VS FORTRAN 2.3.0 without special compiler 
directives. Most BLAS were obtained by modifying L INPACK BLAS [6] . 

We hope that the provided BLAS vectorize well without excessive 
timing increases also on other vector computers. Section 4 contains 
output from a driver illustrating the use of the subroutines. A 
listing of the source code of the driver is provided in the Appendix. 
Section 4 also contains some timing results. 
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2 . THE UPDATING SUBROUTINES 

We consider The subrou Tines oT Table 1.1 in order. 71iese 
subrouTines use auxiliary subrouTines which we need To introduce 
f i rsT . The>" are 1 i sTed in Table 2.1. 



Auxiliary Called by 

subrouTine subrouTine 

DORTHO DINSC, DRNKl 



DORTHX DDELR 



DINVIT DRRPM 



DTRLSL DINVIT 



DTRUSL DINVIT 



Purpose 



CompuTe s; = Q^w . v' : = (I-QQ^)w 

wiTh reorThogonal i zaT i on Tor 
arbiTrary vecTor w. 



CompuTe s : = Q~*^e: 

wiTh reorThogonal 
vecTor ej . 



v: = (l-qq'^)e^. 
izaTion Tor axis 



CompuTe approximation oT a righT 
singular vecTor corresponding to a 
leasT singular value oT R. A Tirst 
approx i maT i on is obTained Trom the 
LINPACK condiTion number estimator 
DTRCQ , and is improved by inverse 
i Te rat ion. 



Solx'e lower Triangular system oT 
equations with TrequenT rescalings in 
order To avoid overTlow. Similar To 
part oT DTRCO. 

Solve upper Triangular linear sx'stem 
oT equations with TrequenT rescalings 
in order To avoid overTlow. Similar 
To part oT DTRCO. 



Table 2.1: Auxiliary subrouTines. 



2.1 SubrouTines DORTHO and DORTHX 

Given a matrix Q E , m > n, with orthonorinal columns and a 

vecTor w E , The subrouTine DORTHO computes the Fourier 
coeTTicients s: = Q^w and The orthogonal projection oT w into the 

null -space oT Q"^ , v: = (I-QQ^)w. At most one reo rt hogona 1 i zat i on is 
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carried out. Since the subroutine DORTHO differs from the 
corresponding Algol procedure ” orthogonal i ze” in [5] we discuss 
DORTHO and its use in some detail. 

Subroutine DORTHO is called by routine DINSC, which updates the 
QR factorization of a matrix A = QR E , m > n, when a column w is 

inserted into A. Updating may not be meaningful if w is nearly a 
linear combination of the columns of Q. Therefore DORTHO computes 
the condition number of the matrix Q: = [Q^w/llwfl] G ^ where the 

norm || || is the Euclidean norm. Using Q~*'Q = I, we obtain the 
following expressions for the singular values > 0^2 — • • • — ^n-|-i of 

Q: 



= (1 + ||Q‘"w||/|1w||) 1/2^ (2.1a) 

<Tj = 1 , 2 < j < n, (2.1b) 

<Tn+i = (1 - I|Q‘^w||/||w||)1/2. (2.1c) 

Further, for v: = ( I -QQ"^ ) ? 

I|v|| = + (2.2) 

Since 1 < < ^^2 , also an accurate estimate of the length of 

the orthogonal projection of w/||w|| into the null-space of . In 



order to avoid severe cancellation of significant digits in (2.1c) we 
determine first from (2.1a) and then from (2.2). 

Subroutines DINSC and DORTHO have an input parameter RCOND which 
is a lower bound for the reciprocal condition number. The 
computations are discontinued and an error flag is set if 
RCOND < . On exit, RCOND: = • 
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Assume now that the input value of RCOND > Then DORIHO 

computes s: = Q~’^w and v: = (I-QQ^)w by a scheme analogous to the 

method described by Parlett [15, p. 107] for orthogonal i z i ng a vector 
against another vector. For definiteness, we present the 
orthogonal i zat i on scheme. References to ^ and RCGND are 

neglected for simplicity. 

Orthogonal i zat i on al gor i t hm ; input Q G (Q has orthonormal 

columns), m,n (m > n), w G (w ^ 0) ; output v (v = (I-QQ~^)w), 

s (s = Q~*~w) ; 

w : = w/||w|| ; 

s: = Q~'~w ; v: = w-Qs; (2.3) 

if ||\^|| > 0.707 then 

v: = v/||v||; s: - s||w|| ; exit; >+< ||v|| = 1. Q"*~v ^ 0 ^ 

s': = Q“^v; v': = v-Qs' ; (2.4) 

i f ||v'|| < 0.707||v|| then 

* w lies in span{Q} numerically ^ 

v: =0; s: = (s+s')||w||; set flag; exit; 

v: = /IK'f s: = ( sfs') ||w|| ; exit; ^ ||v|| = 1. Q~^v = 0 ^ 

The proof in Parlett [15, pp . 107-*10S] that one reort hogona 1 i zat i on 

suffices carries over to the present algorithm, using that Q~^Q == I. 

We note that there are other ways to carry out the computations 
on lines (2.3)-(2.4). In [5], v and v' are updated irnmediatelx' after 
a component of s is computed. Our scheme has the advantages of being 
faster on vector computers, since it allows matrix vector operations, 
and it is also, general 1\', more accurate, since rounding errors 



accumulate less. The latter can easily be shown, and we omit the 
detai 1 s . 

We turn to subroutine DORTHX . This is a Taster version oT 
subroutine DORTHO . DORTHX assumes that w in the orthogonal i zat i on 

algorithm is an axis vector. This simpliTies the computations in 
(2.3) . DORTHX may perTorm nearly twice as Tast as DORTHO. 

2.2 Subroutines DINVIT, DTRLSL and DTRUSL 

Given a nonsingular upper triangular matrix U = ^ and a 

vector b = [/?j] G , DTRUSL solves Ux = bp, where \p\ < 1 is a 

scaling Tactor such that I I ^ 1 Tor all j. The scaling Tactor 
is introduced in order to avoid overTlow when solving very ill- 
conditioned linear systems oT equations. DTRLSL is an analogous 
subroutine Tor lower triangular systems. 

DTRLSL and DTRUSL are called by DINVIT, a subroutine Tor 
computing an approximation oT a right singular vector belonging to a 
least singular value oT a right triangular matrix R. IT R is 
singular then such a singular vector is computed by solving a 
triangular linear system oT equations. Otherwise an initial 
approximate right singular vector a^^^ = is obtained by the 

LINPACK condition number estimator DTRCO , and inverse iteration with 
R^R is used to obtain improved approximations j = 1 , 2 , . . . , NMBIT , 

where NMBIT is an input parameter to DINVIT and DRRPM . On exit Trom 
DINVIT and DRRPM, IPOS(j) contains the least index k such that 
I I ^ I I ’ ^ ^ ^ ^ ^ ^ J ^ NMBIT. On return Trom DINVIT and 

DRRPM the parameter DELTA is given by DELTA: = ||R"’'Ra^^^®*'^^||/||a^^*^®*'*'^|| . 
Hence. DELTA is an upper bound Tor the least singular value oT R. 
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2.3 Updating subroutines 

We are in a position to consider the subroutines of Table 1.1. 

The vector i zat i on is mainly done in tlie BLAS of the next section, but 
some loops of the subroutines of Table 1.1 v^ectorize as well. 

Comments in the source code reveal which loops v^ectorize or are 
eligible for v'ector i zat i on on an IBM 3090-200VF computer with 
compiler VS FORTRAN 2.3.0 to where the default v^ecto r i zat i on 
directives are used. For applications to particular problem classes, 
changing the default vector i zat i on by compiler directives may 
decrease the execution time. 

We list the differences between the subi^outines of Table 1.1 and 
the corresponding Algol procedures of [5] . Some of these 
modifications were suggested in [5] but not implemented in the Algol 
procedures [5] . In subroutine DDELC . the column deleted in A: = QR 

is determined optionally'. Not computing this column sav'es 0(mn) 
arithmetic operations. In subroutine DDELR . the auxiliary' subroutine 
DORTHX is used instead of DORTHO . As indicated in Section 2.1 the 
former subi^outine may' perform nearly' twice as fast. In subroutine 
DINSC, a column w is inserted into A: = QR only' if the reciprocal 
condition number of the matrix [Q*w/||w||] is larger than a bound given 
by the parameter RCOND on entry'. The parameter RCOND can be used to 
prev'ent updating when w/||w|| is nearly in the range of Q. Finally'. 
DRNKl performs slightly' faster if tiie updated matrix A + vu~'' is such 
that V lies numerically' in the range of A. 

The subroutine DRRPM implements an algorithm presented by' Chan 
[3]. The computation of an approximate right singular vector 
corresponding to a least singular value is done by' subroutine DINAIT 
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and has al read>^ been discussed. The posit, ion oT a component oT 
lai^gest magnitude oT this singular vector has to be determined, and 
we Tound , in agreement with Chants suggestion [3] , that two inverse 
iterations suTTice. In Tact, in all computed examples, one inverse 
iteration was suTTicient, even Tor problems with multiple or close 
least singulai^ values. The subroutine permutes the order oT columns 
1 through k oT AH where k is an input parameter, A G ^ m > n, and 

n i s a permutation matrix. DRRPM is typically called with 
k = n,n-l,n-2,... until no Turther per mu tat ion is made or unt i 1 the 
computed upper bound DELTA Tor the least singular value oT the matrix 
consisting oT the T i rst k columns oT AH is not small. 

The subroutines oT Table 1.1 do neither require nor produce a 
Tactor i zat i on with nonnegative diagonal elements oT the upper 
triangular matrix. 

3. THE BLAS 

Much computational experience on a variety oT computers led 
Dongarra and Sorensen [7] to conclude that nearly optimal perTormance 
oT numerical linear algebra subroutines can be achieved iT the sub- 
routines Tor the basic matrix and vector operations, such as multi- 
plication, addition and inner product computation, are written to 
perTorm well on vector computers. We wanted to write a code that 
perTorms well on an IBM 3090-200VF computer, and that would not 
require excessive tuning when moved to other (vector) computers. 
ThereTore we designed the code to vectorize well without special 
compiler instructions, since the latter would be machine dependent. 
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A -feature o-f the VS FOrmiAN 2.3.0 compiler i that unnecess-ai'y 
vector loads and stores are avoided by introducing a tern[)orary scalar 
v'ariable, denoted by ACC in the subroutine DAP.X in Example .3.1. 

During execution ACC should be thouglit o-f a.s a vector variable stored 
in a vector register. Timings -for DAPX and comparison with code with 
explicitly' unrolled loops hav'e been carried out by Robert arid 

Squazzero [16] . These timings show subroutine DAP.X to perform better 
than equivalent subroutines with explicitly unrolled loops. 



Examp 1 e 3.1. 



C 

C 

C 



C 

C 

C 



20 



10 



Subroutine -for matrix vector multiplication. 
SUBROUTINE DAP.X (A , EDA . M . N . X . Y) 

DAPX COMPUTES Y:=A*.X. 

INTEGER LDA,M.N.I.J 

REALMS A(LDA.N) ,.X(N) ,Y(M) ,ACC 

OUTER LOOP VECTORIZES. 

DO 10 1=1. M 
ACC=0D0 
DO 20 J=1 ,N 

ACC=ACC+A(I ..1)*X(J) 

CONTINUE 
Y ( I ) =ACC 
CONTINUE 
RETURN 

END □ 



Temporary scalar variables have also been used in others o"f the 17 
BLAS used . 



4. COMPUTED EXAMPLES 

Example 4.1. In this e.xample the QR decomposition of a 4 x 3 matri.x 
A is updated . The u.se of all sub rout ines of Table 1.1 is 
illustrated. The main program producing this output is listed in the 
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Example 4.2. Execution times ^or subroutines DDELCO and DHNKl are 
compared for scalar and vector aritiimetic. The measured cpu times 
differed somewhat between diTTerent executions oT the same code. 
ThereTore the reported times are rounded to one signiTicant digit and 
the quotient oT measured cpu times are rounded to the nearest 
multiple of 1/2. 

Table 4.1 shows the cpu times for DDELCO. This routine and its 
subroutines have been compiled with the VS FORTRAN 2.3.0 compiler. 

The times for vector arithmetic are obtained from code generated with 
compiler option vlev = 2, which makes the compiler generate code that 
utilizes the vector registers and arithmetic. The times for scalar 
arithmetic are obtained from code generated with compiler option vlev 
= 0, which makes the compiler generate code that does not use vector 
instructions. Given a QR decomposition of a matrix A G , Table 

4.1 shows the cpu time required by DDELCO to compute the QR 
decomposition of A G obtained by deleting column one of A. 



m 


n 


scalar arithmetic 


vector arithmetic 


scalar time 
vector t i me 


10 


10 


4-10“* 


4-10“’ 


1 


20 


10 


4-10"'’ 


4-10''’ 


1 


30 


10 


5-10“^ 


4-10“’ 


1 .5 


50 


10 


6-10"’ 


4-10“’ 


1 .5 


75 


10 


8-10"’ 


4-10'‘^ 


2 


128 


10 


I-IO'^ 


5-10"’ 


2 


1024 


10 


7- 10'^ 


210'^ 


3.5 


1280 


10 


9-10'^ 


310'^ 


3.5 



Table 4.1: Timings for DDELCO 
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Table 4.2 is similar to Table 4.1 and contains execution times Tor 
DRNKl . The reduction in execution time obtained by using vector 
instructions is oT the same order of magnitude for the other updating 
routines, too. 

cpu time in seconds 



m 


n 


seal 


ar arithmetic vector arithmetic 


seal ar 
vector 


time 
t i me 


16 


12 




l-lO'^ 


l-lO'^ 


1 




32 


25 




4-10'^ 


3-10*^ 


1 .5 




64 


50 




2-10'^ 


7- 10'^ 


2 




128 


100 




6-10'^ 


210'^ 


2.5 




1024 


100 




4-10‘^ 


8-10'2 


4.5 




1250 


100 




510‘^ 


910'^ 


5 










Table 4.2: 


Timings for DRNKl 




□ 


Exampl 


e 4.3. 


Execution times for s 


ubroutines written 


by Buckl 


ey [2] 


and th 


ose 


Table 


1 . 1 are compared 


. The vectorized and scalar code 



were generated as explained in Example 4.2. We found that 
vector i zat i on of the subroutines in [2] did not change the execution 
times significantly, generally less than 20%. In all computed 
examples the vectorized subroutines in [2] required at least twice as 
much execution time than the vectorized subroutines of Table 1.1. 

For certain problems our vectorized code executed up to 95 times 
faster than the vectorized code in [2] . For scalar code the 
differences in execution time often decreased with increasing matrix 
size. Tables 4. 3-4. 6 present some sample timings. 
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L i me for 


L i me f 


m 


DDELR 


DELROW 


10 


3-10'^ 


9-10''’ 


64 


7-10‘^ 


2- 10'^ 


128 


S-IO'"* 


3-10’^ 


1024 


4-10'^ 


2- 10'^ 



^itne for DELHOW [2] 
■Lime for DDICLH 

3.510^ 

2 - 10 ^ 

3 

4 



Table 4*3: The first, row of A = QR is deleted. Cpu times for 

vectorized code for updating Q and R are given in 

seconds; n = 10. 



m 


n 


L i me for 

DELC 


Lime for 
DELCOL [2] 


time for DELCOL [2] 
time for DDELC 


1024 


10 


I-IO'"* 


2-10"'' 


2 


1024 


100 


1 ■ 1 O'"* 


7-10'^ 


7.5-10^ 


1280 


100 


I-IO'" 


9-10'^ 


9.5-10^ 



Table 4.4: The last column of A = QR is deleted. Cpu times for 

vectorized code for updating Q and R are given in 
seconds. DDELC does not compute the last column of 

A , i . e . , I FLAG = 0 on e nt ry . 



time for 



t i me for 



time for INSCOL [2] 
time for DINSC 



m 


DINSC 


INSCOL [2] 




64 


I-IO'^ 


2- 10'^ 


2 


128 


I-IO'^ 


3-10'^ 


2 


1024 


7-10'^ 


2- 10'^ 


2.5 



Table 4.5: A new first column is inserted into A = QR . Cpu times 

for vectorized code for updating Q and R are gi\^en in 

seconds: n = 10. 



Tables 4. 3-4. 5 present timings for vectorized code. The next table 
shows timings for scalar code for the same updatings as in Table 4.3. 
Table 4.6 shows that, without ve ctor i zat i on , DELROW [2] requires 50^ 
more cpu time than DDELR for moderately large problems. 
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t:ime -for 



time for 



time for DELRPW 
time for DDELR 



m 


DDELR 


DELROW [2] 




10 


210'^ 


e-io'"* 


3-10^ 


64 


llO'^ 


2- 10'^ 


1.5 


128 


2-1 0'^ 


310'^ 


1 .5 


1024 


I-IO'^ 


2- 10'^ 


1 .5 



[ 2 ] 



Table 4.6: The first row of A = QR is deleted. Cpu times for 

scalar code for updating Q and R are given in seconds; 
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